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Abstract 


Usernames are ubiquitous on the Internet, 
and they are often suggestive of user de¬ 
mographics. This work looks at the de¬ 
gree to which gender and language can 
be inferred from a username alone by 
making use of unsupervised morphology 
induction to decompose usernames into 
sub-units. Experimental results on the 
two tasks demonstrate the effectiveness of 
the proposed morphological features com¬ 
pared to a character n-gram baseline. 


1 Introduction 


There is much interest in automatic recognition 
of demographic information of Internet users to 
improve the quality of online interactions. Re¬ 
searchers have looked into identifying a variety 
of factors about users, including age, gender, lan¬ 
guage, religious beliefs and political views. Most 
work leverages multiple sources of information, 
such as search query history, Twitter feeds. Face- 
book likes, social network links, and user profiles. 
However, in many situations, little of this infor¬ 
mation is available. Conversely, usernames are al¬ 
most always available. 

In this work, we look specifically at classi¬ 
fying gender and language based only on the 
username. Prior work by sociologists has es¬ 
tablished a link between usernames and gen¬ 


der ( |Cornetto and Nowak, 2006] ), and studies have 
linked usernames to other attributes, such as in¬ 
dividual beliefs dCrabill, 2007 [ |Hassa, 2012| | and 
shown how usernames shape perceptions of gen¬ 
der and ethnicity in the absence of common non¬ 
verbal cues dPelletier, 2014| ). The connections 
to ethnicity motivate the exploration of language 
identification. 

Gender identification based on given names is 
very effective for English dEiu and Ruths, 20T3] ), 


since many names are strongly associated with a 
particular gender, like “Emily” or “Mark”. Unfor¬ 
tunately, the requirement that each username be 
unique precludes use of given names alone. In¬ 
stead, usernames are typically a combination of 
component words, names and numbers. For ex¬ 
ample, the Twitter name @taylorswiftl3 might 
decompose into “taylor”, “swift” and “13”. The 
sub-units carry meaning and, importantly, they are 
shared with many other individuals. Thus, our ap¬ 
proach is to leverage automatic decomposition of 
usernames into sub-units for use in classification. 

We use the Morfessor algorithm 
( Creutz and Eagus, 2006} Virpioja et al., 201^ 
for unsupervised morphology induction to learn 
the decomposition of the usernames into sub¬ 
units. Morfessor has been used successfully 
in a variety of language modeling frameworks 
applied to a number of languages, particularly for 
learning concatenative morphological structure. 
The usernames that we analyze are a good match 
to the Morfessor framework, which allows us to 
push the boundary of how much can be done with 
only a username. 

The classifier design is described in the next 
section, followed by a description of experiments 
on gender and language recognition that demon¬ 
strate the utility of morph-based features com¬ 
pared to character n-gram features. The paper 
closes with a discussion of related work and a 
summary of key findings. 


2 General Classification Approach 

2.1 Unsupervised Morphology Learning 

In linguistics, a morpheme is the “minimal lin¬ 
guistic unit with lexical or grammatical meaning” 
( Booij, 2012[ ). Morphemes are combined in vari¬ 
ous ways to create longer words. Similarly, user- 
names are frequently made up of a concatenated 
sequence of smaller units. These sub-units will be 





















referred to as u-morphs to highlight the fact that 
they play an analogous role to morphemes but for 
purposes of encoding usernames rather than stan¬ 
dard words in a language. The u-morphs are sub¬ 
units that are small enough to be shared across dif¬ 
ferent usernames but retain some meaning. 

Unsupervised morphology induction using 
Morfessor (Creutz and Lagus, 20061 is based on 
a minimum description length (MDL) objective, 
which balances two competing goals: maximizing 
both the likelihood of the data and of the model. 
The likelihood of the data is maximized by longer 
tokens and a bigger lexicon whereas the likeli¬ 
hood of the model is maximized by a smaller lexi¬ 
con with shorter tokens. A parameter controls the 
trade-off between the two parts of the objective 
function, which alters the average u-morph length. 
We tune this parameter on held-out data to opti¬ 
mize the classification performance of the demo¬ 
graphic tasks. 

Maximizing the Morfessor objective exactly is 
computationally intractable. The Morfessor algo¬ 
rithm searches for the optimal lexicon using an 
iterative approach. First, the highest probability 
decomposition for each training token is found 
given the current model. Then, the model is up¬ 
dated with the counts of the u-morphs. A u- 
morph is added to the lexicon when it increases 
the weighted likelihood of the data by more than 
the cost of increasing the size of the lexicon. 

Usernames can be mixed-case, e.g. “JohnDoe”. 
The case change gives information about a likely 
u-morph boundary, but at the cost of doubling the 
size of the character set. To more effectively lever¬ 
age this cue, all characters are made lowercase but 
each change from lower to uppercase is marked 
with a special token, e.g. “john$doe”. Using this 
encoding reduces the u-morph inventory size, and 
we found it to give slightly better results in lan¬ 
guage identification. 

Character 3-grams and 4-grams are used as 
baseline features. Before extracting the n-grams 
a “#” token is placed at the start and end of each 
username. The n-grams are overlapping to give 
them the best chance of finding a semantically 
meaningful sub-unit. 


2.2 Classifier Design 

Given a decomposition of the username into a se¬ 
quence of u-morphs (or character n-grams), we 
represent the relationship between the observed 


features and each class with a unigram language 
model. If a username u has decomposition 
772-1) • • •) w-n then it is assigned to the class q for 
which the unigram model gives it the highest pos¬ 
terior probability, or equivalently: 

n 

argmaxj pc(ci) P(”^fclcj), 
k=l 


where pc{ci) is the class prior and p{mk\ci) is the 
class-dependent unigramHJ 

For some demographics, the class prior can be 
very skewed, as in the case of language detec¬ 
tion where English is the dominant language. The 
choice of smoothing algorithm can be important in 
such cases, since minority classes have much less 
training data for estimating the language model 
and benefit from having more probability mass as¬ 
signed to unseen words. Here, we follow the ap¬ 
proach proposed in ( [Frank and Bouckaert, 2006) 
that normalizes the token count vectors for each 
class to have the same Li norm, specifically: 


Pimk\ci) 


J_ P • n{mk,Ci) \ 
n{cP )’ 


where n(-) indicates counts and /3 controls the 
strength of the smoothing. Setting /3 equal to 
the number of training examples approximately 
matches the strength of the smoothing to the add- 
one-smoothing algorithm. Z = /3 -)- |M| is a con¬ 
stant to make the probabilities sum to one. 

Only a small portion of usernames on the In¬ 
ternet come with gender labels. In these situa¬ 
tions, semi-supervised learning algorithms can use 
the unlabeled data to improve the performance of 
the classifier. We use a self-training expectation- 
maximization (EM) algorithm similar to that de¬ 
scribed in (Nigam et ah, 2000l. The algorithm 
first learns a classifier on the labeled data. In the 
E-step, the classifier assigns probabilistic labels to 
the unlabeled data. In the M-step, the labeled data 
and the probabilistic labels are combined to learn 


'Note that the unigram model used here, which consid¬ 
ers only the observed u-morphs or n-grams, is not the same 
as using a Naive Bayes (NB) classifier based on a vector 
of u-morph counts. In the former, unobserved u-morphs do 
not impact the class-dependent probability, whereas the zero 
counts do impact the probability for the NB classifier. Since 
the vast majority of possible u-morphs are unobserved in a 
username, it is better to base the decision only on the ob¬ 
served u-morphs. The n-gram model is actually a unigram 
with an n-gram “vocabulary” rather than an n-gram language 
model. 









a new classifier. These steps are iterated until con¬ 
vergence, which usually requires three iterations 
for our tasks. 

Nigam et al. (120001) call their method EM-A be¬ 
cause it uses a parameter A to reduce the weight 
of the unlabeled examples relative to the labeled 
data. This is important because the independence 
assumptions of the unigram model lead to over¬ 
confident predictions. We used another method 
that directly corrects the estimated posterior prob¬ 
abilities. Using a small validation set, we binned 
the probability estimates and calculated the true 
class probability for each bin. The EM algorithm 
used the corrected probabilities for each bin for 
the unlabeled data during the maximization step. 
Samples with a prediction confidence of less than 
60% are not used for training. 

3 Experiments 

3.1 Gender Identification 

Data was collected from the OkCupid dating site 
by downloading up to 1,000 profiles from 27 cities 
in the United States, first for men seeking women 
and again for women seeking men to obtain a bal¬ 
anced set of 44,000 usernames. The data is parti¬ 
tioned into three sets with 80% assigned to train¬ 
ing and 10% each to validation and test. We also 
use 3.5M usernames from the photo messaging 
app Snapchat dMcCormick, 2014| ): 1.5M are used 
for u-morph learning and 2M are for self-training. 
All names in this task used only lower case, due to 
the nature of the available data. 


Male 

Female 

, guy, mike, 
u-morph . , 

matt, josh 

girl, marie, 
lady, miss 

. guy, uy#, 

trigram r ■ 

kev, joe 

irl, gir, 
grl, emm 


Table 1: Top Gender Identification Eeatures 

The top features ranked by likelihood ratios 
are given in Table [T] The u-morphs clearly carry 
semantic meaning, and the trigram features ap¬ 
pear to be substrings of the top u-morph fea¬ 
tures. The trigram features have an advantage 
when the u-morphs are under-segmented such as 
if the u-morph “niceguy” or “thatguy” is included 
in the lexicon. Conversely, the n-grams can suf¬ 
fer from over-segmentation. Eor example, the tri¬ 
gram “guy” is inside the surname “Nguyen” even 
though it is better to ignore that substring in this 


context. Many other tokens suffer from this prob¬ 
lem, e.g. “miss” is in “mission”. 

The variable-length u-morphs are longer on av¬ 
erage than the character n-grams (4.9 characters). 
The u-morph inventory size is similar to that for 
3-grams but 5-10 times smaller than the 4-gram 
inventory, depending on the amount of data used 
since the inventory is expanded in semi-supervised 
training. By using the MDE criterion in unsuper¬ 
vised morphology learning, the u-morphs provide 
a more efficient representation of usernames than 
n-grams and make it easier to control the trade¬ 
off between vocabulary size and average segment 
length. The smaller inventory is less sensitive to 
sparse data in language model training. 


Eeatures 

Error Rate 

Supervised 

Self-Training 

3-gram 

28.7% 

32.0% 

4-gram 

28.7% 

29.4% 

u-morph 

27.8% 

25.8% 


Table 2: Gender Classification Results 


The experiment results are presented in Table |2] 
Eor the supervised learning method, the character 
3-gram and 4-gram features give equivalent per¬ 
formance, and the u-morph features give the low¬ 
est error rate by a small amount (3% relative). 
More significantly, the character n-gram systems 
do not benefit from semi-supervised learning, but 
the u-morph features do. The semi-supervised 
u-morph features obtain an error rate of 25.8%, 
which represents a 10% relative reduction over the 
baseline character n-gram results. 


3.2 Language Identification on Twitter 

This experiment takes usernames from the Twitter 
streaming API. Each username is associated with 
a tweet, for which the Twitter API identifies a lan¬ 
guage. The language labels are noisy, so we re¬ 
move approximately 35% of the tweets where the 
Twitter API does not agree with the langid.py clas¬ 
sifier dEui and Baldwin, 2012| ). Both training and 
test sets are restricted to the nine languages that 
comprise at least 1% of the training set. These 
languages cover 96% of the observed tweets (see 
TablelH). About 110,000 usernames were reserved 
for testing and 430,000 were used for training 
both u-morphs and the classifier. Semi-supervised 
methods are not used because of the abundant la¬ 
beled data. Eor each language, we train a one-vs.- 
all classifier. The mixed case encoding technique 














Model 

Free. 

Recall 

FI 

4-gram 

.67 

.75 

.70 

u-morph 

.77 

.67 

.71 

Combination 

.73 

.73 

.73 


Table 3: Precision, recall and FI score for lan¬ 
guage identification using the 4-gram, u-morph 
representations and a combination system, averag¬ 
ing over all users. 

(see sec. 12.Ill gives a small increase (0.5%) in the 
accuracy of the model and reduces the u-morph 
model size by 5%. 

The results in Tables [3] and |4] contrast sys¬ 
tems using 4-grams, u-morphs, and a combina¬ 
tion model, showing precision-recall trade-offs for 
all users together and FI scores broken down by 
specific languages, respecfively. The combina¬ 
tion system simply uses the average of the pos¬ 
terior log-probabilities for each class giving equal 
weight to each model. While the overall FI scores 
are similar for the 4-gram and u-morph systems, 
their precision and recall trade-offs are quite dif¬ 
ferent, making them effective in combination. The 
4-gram system has higher recall, and the u-morph 
system has higher precision. With the combina¬ 
tion, we obtain a substantial gain in precision over 
the 4-gram system with a modest loss in recall, re¬ 
sulting in a 3% absolute improvement in average 
FI score. 

Looking at performance on the different lan¬ 
guages, we find that the FI score for the combi¬ 
nation model is higher than the 4-gram for every 
language, with precision always improving. For 
the dominant languages, the difference in recall 
is negligible. The infrequent languages have a 4- 
8% drop in recall, but the gains in precision are 
substantial for these languages, ranging from 50- 
100% relative. The greatest contrast between the 
4-gram and the combination system can be seen 
for the least frequent languages, i.e. the languages 
with the least amount of training data. In partic¬ 
ular, for French, the precision of the combination 
system (0.36) is double that of the 4-gram model 
(0.18) with only a 34% loss in recall (0.24 to 0.16). 

Looking at the most important features from 
the classifier highlights the ability of the mor¬ 
phemes to capture relevant meaning. The pres¬ 
ence of the morpheme “juan”, “jose” or “flor” in¬ 
crease the probability of a Spanish language tweet 
by hve times. The same is true for Portuguese 


Language 

Freq 

4-gr 

u-m 

Comb 

English 

43.5 

.78 

.78 

.79 

Japanese 

21.6 

.75 

.76 

.77 

Spanish 

15.1 

.66 

.65 

.68 

Arabic 

7.9 

.66 

.65 

.68 

Portuguese 

6.2 

.50 

.55 

.58 

Russian 

1.8 

.40 

.58 

.45 

Turkish 

1.8 

.59 

.36 

.65 

Erench 

1.1 

.18 

.13 

.22 

Indonesian 

1.1 

.34 

.30 

.43 


Table 4: Language identification performance (F1 
Scores) and relative frequency in the corpus for 
4-gram (4-gr) and u-morph (u-m) representations 
and the combination system (Comb). 

and the morpheme “bieber”. The morpheme “q8” 
increases the odds of an Arabic language tweet 
by thirteen times due to its phonetic similarity to 
the name of the Arabic speaking country Kuwait. 
Other features may simply reflect cultural norms. 
For example, having an underscore in the user- 
name makes it five percent less likely to observe an 
English tweet. These highly discriminative mor¬ 
phemes are both long and short. It is hard for the 
fixed-length n-grams to capture this information as 
well as the morphemes do. 

4 Related Work 

Of the many studies on automatic classification 
of online user demographics, few have leveraged 
names or usernames at all, and the few that do 
mainly explore their use in combination with other 
features. The work presented here differs in its use 
of usernames alone, but more importantly in the 
introduction of morphological analysis to handle a 
large number of usernames. 

Two studies on gender recognition are particu¬ 
larly relevant. Burger et al. (120111) use the Twitter 
username (or screen name) in combination with 
other profile and text features to predict gender, 
but they also look at the use of username features 
alone. The results are not directly comparable to 
ours, because of differences in the data set used 
(150k Twitter users) and the classifier framework 
(Winnow), but the character n-gram performance 
is similar to ours (21-22% different from the ma¬ 
jority baseline). The study uses over 400k char¬ 
acter n-grams (n=l-5) for screen names alone; our 
study indicatess that the u-morphs can reduce this 
number by a factor of 10. Burger et al. (120111) 










used the same strategy with the self-identified full 
name of the user as entered into their profile, 
obfaining 89% gender recognition (vs. 77% for 
screen names). Later, Liu and Rufhs (I2013II use 
fhe full firsf name from a user’s profile for gender 
defection, finding fhaf for fhe names fhaf are highly 
predicfive of gender, performance improves by re¬ 
lying on fhis feafure alone. However, more fhan 
half of fhe users have a name fhaf has an unknown 
gender associafion. Manual inspecfion of fhese 
cases indicated fhaf fhe majorify includes sfrings 
formed like usernames, nicknames or ofher fypes 
of word concatenations. These examples are pre¬ 
cisely whaf fhe u-morph approach fries fo address. 

Language idenfificafion is an acfive 
area of research (Bergsma ef ah, 2012 
Zubiaga et ah, 2014), buf fhe username has 
nof been used as a feafure. Again, resulfs are 
difficulf fo compare due fo fhe lack of a common 
fesf sef, buf if is nofable fhaf fhe average FI score 
for fhe combination model approaches fhe scores 
obfained on a similar Twiffer language identifi¬ 
cation fask where fhe algorifhm has access fo fhe 
full fexf of fhe fweef ( |Lui and Baldwin, 2014| |: 
73% vs. 77% . 

A sfudy fhaf is potentially relevanf fo our work 
is automafic classificafion of efhnicify of Twiffer 
users, specifically whefher a user is African- 
American (iPennacchioffi and Popescu, 20111). 


Again, a variefy of confenf, profile and behavioral 
feafures are used. Orthographic feafures of fhe 
username are used (e.g. length, number of nu¬ 
meric/alpha characters), and names of users that 
a person retweets or replies to. The profile name 
feafures do nof appear fo be useful, but examples 
of related usernames point to the utility of our 
approach for analysis of names in other fields. 


5 Conclusions 


cation is particularly remarkable because if comes 
close fo mafching fhe performance achieved by us¬ 
ing fhe full fexf of a fweef. The work is com- 
plemenfary fo ofher demographic sfudies in fhaf 
fhe username prediction can be used fogefher wifh 
ofher feafures, bofh for fhe user and members of 
his/her social nefwork. 

The mefhods proposed here could be extended 
in differenf direcfions. The unsupervised morphol¬ 
ogy learning algorifhm could incorporate priors 
relafed fo capifalizafion and non-alphabefic char¬ 
acters fo better model fhese phenomena fhan our 
simple fexf normalizafion approach. More so¬ 
phisticated classifiers could also be used, such as 
variable-lengfh n-grams or neural-nefwork-based 
n-gram language models, as opposed fo fhe uni¬ 
gram model used here. Of course fhe sophisfica- 
fion of fhe classifier will be limited by fhe amounf 
of fraining dafa available. 

A large amounf of dafa is nof necessary fo build 
a high precision username classifier. For example, 
less fhan 7,000 training examples were available 
for Turkish in the language identification exper¬ 
iment and the classifier had a precision of 76%. 
Since liffle dafa is required, fhere may be many 
more applicafions of fhis fype of model. 

Prior work on unsupervised morphological in¬ 
duction focused on applying fhe algorifhm fo nafu- 
ral language inpuf. By using fhose fechniques wifh 
a new fype of inpuf, fhis paper shows fhaf fhere are 
ofher applicafions of morphology learning. 
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